{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "127a3a72-05e4-4336-b0ee-4856bf481248",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-06-22T02:03:38.004105Z",
     "iopub.status.busy": "2024-06-22T02:03:38.003588Z",
     "iopub.status.idle": "2024-06-22T02:03:38.007672Z",
     "shell.execute_reply": "2024-06-22T02:03:38.007017Z",
     "shell.execute_reply.started": "2024-06-22T02:03:38.004078Z"
    }
   },
   "source": [
    "# 简介\n",
    "\n",
    "在开始之前,我们可以先认识一下什么是 IPEX-LLM, IPEX-LLM是一个PyTorch库，用于在Intel CPU和GPU（例如，具有iGPU的本地PC,Arc、Flex和Max等独立GPU）上以非常低的延迟运行LLM.总而言之我们可以利用它加快大语言模型在 intel 生态设备上的运行速度;无需额外购买其他计算设备,我们可以高速率低消耗的方式在本地电脑上运行大语言模型.\n",
    "\n",
    "在本次比赛的第一篇教程中,我们就能掌握 IPEX-LLM 的基本使用方法,我们将利用 IPEX-LLM 加速 Qwen2 语言模型的运行,跟随这篇 notebook 一步步仔细操作,我们可以简单快速的掌握大语言模型在 intel 硬件上的高性能部署.\n",
    "\n",
    "# 一、安装环境\n",
    "\n",
    "在开始运行推理之前，我们需要准备好运行 qwen2 需要的必须环境，此时请确保你进入的镜像是 `ubuntu22.04-py310-torch2.1.2-tf2.14.0-1.14.0` 否则将会看到找不到 conda 文件夹的报错，切记。\n",
    "\n",
    "你将在终端运行下列脚本,进行 ipex-llm 的正式 conda 环境的恢复，恢复完成后关闭所有开启的 notebook 窗口，然后重新打开，才能正常切换对应 kernel。\n",
    "\n",
    "那么，什么是 kernel 呢？简单理解，它用于提供 python\n",
    "代码运行所需的所有支持，而会把我们的消息发送到对应的 kernel 进行执行。你可以在 notebook 右上角看到 Python3(ipykernel) 的字样，它代表默认环境的内核；我们可以通过在对应虚拟环境启动 jupyter notebook 使用对应虚拟环境的内核环境，也可以使用类似 `python3 -m ipykernel install --name=ipex` 的指令将某个虚拟环境（在这里是 ipex）注册到 notebook 的可使用内核中。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90956f90-fc56-4e3b-bb23-b99fb63afeb3",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:14:11.745942Z",
     "iopub.status.busy": "2024-06-25T14:14:11.745358Z",
     "iopub.status.idle": "2024-06-25T14:14:11.829346Z",
     "shell.execute_reply": "2024-06-25T14:14:11.828095Z",
     "shell.execute_reply.started": "2024-06-25T14:14:11.745896Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile /mnt/workspace/install.sh\n",
    "# 切换到 conda 的环境文件夹\n",
    "cd  /opt/conda/envs \n",
    "mkdir ipex\n",
    "# 下载 ipex-llm 官方环境\n",
    "wget https://s3.idzcn.com/ipex-llm/ipex-llm-2.1.0b20240410.tar.gz \n",
    "# 解压文件夹以便恢复原先环境\n",
    "tar -zxvf ipex-llm-2.1.0b20240410.tar.gz -C ipex/ && rm ipex-llm-2.1.0b20240410.tar.gz\n",
    "# 安装 ipykernel 并将其注册到 notebook 可使用内核中\n",
    "/opt/conda/envs/ipex/bin/python3 -m pip install ipykernel && /opt/conda/envs/ipex/bin/python3 -m ipykernel install --name=ipex"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78a75102-fb8f-4492-926d-393155c5482f",
   "metadata": {},
   "source": [
    "当你运行完上面的代码块后，此时会在 `/mnt/workspace` 目录下创建名为 `install.sh` 名字的 bash 脚本，你需要打开终端，执行命令 `bash install.sh` 运行 bash 脚本，等待执行完毕后关闭所有的 notebook 窗口再重新打开，直到你在右上角点击 `Python3 (ipykernel)` 后可以看到名为 `ipex` 的环境，点击后切换即可进入到 `ipex-llm` 的正式开发环境，你也可以在终端中执行 `conda activate ipex` 启动 ipex 的虚拟环境，至此准备工作完成。"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd7b278f-49a7-4ee8-bc79-834622fa2753",
   "metadata": {},
   "source": [
    "# 二、模型准备\n",
    "\n",
    "Qwen2是阿里云最新推出的开源大型语言模型系列，相比Qwen1.5，Qwen2实现了整体性能的代际飞跃，大幅提升了代码、数学、推理、指令遵循、多语言理解等能力。\n",
    "\n",
    "包含5个尺寸的预训练和指令微调模型：Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B、Qwen2-57B-A14B和Qwen2-72B，其中Qwen2-57B-A14B为混合专家模型（MoE）。所有尺寸模型都使用了GQA（分组查询注意力）机制，以便让用户体验到GQA带来的推理加速和显存占用降低的优势。\n",
    "\n",
    "在中文、英语的基础上，训练数据中增加了27种语言相关的高质量数据。增大了上下文长度支持，最高达到128K tokens（Qwen2-72B-Instruct）。\n",
    "\n",
    "在这里，我们将使用 `Qwen/Qwen2-1.5B-Instruct` 的模型参数版本来体验 Qwen2 的强大能力。\n",
    "\n",
    "首先，我们需要对模型进行下载，我们可以通过 modelscope 的 api 很容易实现模型的下载：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ca434513-e0b1-42f4-bdbf-21e3ed9a2af6",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:30:14.889188Z",
     "iopub.status.busy": "2024-06-25T14:30:14.888805Z",
     "iopub.status.idle": "2024-06-25T14:30:35.840622Z",
     "shell.execute_reply": "2024-06-25T14:30:35.839943Z",
     "shell.execute_reply.started": "2024-06-25T14:30:14.889158Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-08-26 17:47:57,763 - modelscope - INFO - PyTorch version 2.2.2 Found.\n",
      "2024-08-26 17:47:57,764 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer\n",
      "2024-08-26 17:47:57,765 - modelscope - INFO - No valid ast index found from /root/.cache/modelscope/ast_indexer, generating ast index from prebuilt!\n",
      "2024-08-26 17:47:57,800 - modelscope - INFO - Loading done! Current index file version is 1.13.3, with md5 efd8926a2280ac10b2165c5552dcfaa1 and a total number of 972 components indexed\n",
      "/root/anaconda3/envs/ipex/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "Downloading: 100%|██████████| 660/660 [00:00<00:00, 198kB/s]\n",
      "Downloading: 100%|██████████| 48.0/48.0 [00:00<00:00, 20.5kB/s]\n",
      "Downloading: 100%|██████████| 242/242 [00:00<00:00, 211kB/s]\n",
      "Downloading: 100%|██████████| 11.1k/11.1k [00:00<00:00, 9.95MB/s]\n",
      "Downloading: 100%|██████████| 1.59M/1.59M [00:00<00:00, 7.86MB/s]\n",
      "Downloading:  76%|███████▌  | 2.19G/2.88G [03:34<01:06, 11.1MB/s]2024-08-26 17:51:37,651 - modelscope - WARNING - Downloading: model.safetensors failed, reason: ('Connection broken: IncompleteRead(152397187 bytes read, 15374973 more expected)', IncompleteRead(152397187 bytes read, 15374973 more expected)) will retry\n",
      "Downloading: 100%|█████████▉| 2.88G/2.88G [04:20<00:00, 11.8MB/s]\n",
      "Downloading: 100%|██████████| 3.47k/3.47k [00:00<00:00, 2.52MB/s]\n",
      "Downloading: 100%|██████████| 6.70M/6.70M [00:00<00:00, 21.1MB/s]\n",
      "Downloading: 100%|██████████| 1.26k/1.26k [00:00<00:00, 1.21MB/s]\n",
      "Downloading: 100%|██████████| 2.65M/2.65M [00:00<00:00, 12.7MB/s]\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "from modelscope import snapshot_download, AutoModel, AutoTokenizer\n",
    "import os\n",
    "# 第一个参数表示下载模型的型号，第二个参数是下载后存放的缓存地址，第三个表示版本号，默认 master\n",
    "model_dir = snapshot_download('Qwen/Qwen2-1.5B-Instruct', cache_dir='../model/qwen2chat_src', revision='master')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d647d5a-a675-4ddf-8cd6-ccabd0b5140e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-06-22T10:59:26.623781Z",
     "iopub.status.busy": "2024-06-22T10:59:26.623253Z",
     "iopub.status.idle": "2024-06-22T10:59:26.627138Z",
     "shell.execute_reply": "2024-06-22T10:59:26.626571Z",
     "shell.execute_reply.started": "2024-06-22T10:59:26.623758Z"
    }
   },
   "source": [
    "下载完成后，我们将对 qwen2 模型进行低精度量化至 int4 ，低精度量化（Low Precision Quantization）是指将浮点数转换为低位宽的整数（这里是int4），以减少计算资源的需求和提高系统的效率。这种技术在深度学习模型中尤其重要，它可以在硬件上实现快速、低功耗的推理，也可以加快模型加载的速度。\n",
    "\n",
    "经过 Intel ipex-llm 优化后的大模型加载 api `from ipex_llm.transformers import AutoModelForCausalLM`， 我们可以很容易通过 `load_in_low_bit='sym_int4'` 将模型量化到 int4 ，英特尔 IPEX-LLM 支持 ‘sym_int4’, ‘asym_int4’, ‘sym_int5’, ‘asym_int5’ 或 'sym_int8’选项，其中 ‘sym’ 和 ‘asym’ 用于区分对称量化与非对称量化。 最后，我们将使用 `save_low_bit` api 将转换后的模型权重保存到指定文件夹。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5b7946f4-ad7e-4feb-bca9-6a44ca7bfe15",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:30:35.842493Z",
     "iopub.status.busy": "2024-06-25T14:30:35.841990Z",
     "iopub.status.idle": "2024-06-25T14:31:26.282108Z",
     "shell.execute_reply": "2024-06-25T14:31:26.281361Z",
     "shell.execute_reply.started": "2024-06-25T14:30:35.842461Z"
    },
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "/root/model/../model/qwen2chat_src/Qwen/Qwen2-1___5B-Instruct\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-08-26 17:53:41,282 - INFO - Converting the current model to sym_int4 format......\n",
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    }
   ],
   "source": [
    "from ipex_llm.transformers import AutoModelForCausalLM\n",
    "from transformers import  AutoTokenizer\n",
    "import os\n",
    "if __name__ == '__main__':\n",
    "    model_name = \"qwen2\"\n",
    "    model_path = os.path.join(os.getcwd(),\"../model/qwen2chat_src/Qwen/Qwen2-1___5B-Instruct\")\n",
    "\n",
    "    print(model_path)\n",
    "    model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit='sym_int4', trust_remote_code=True)\n",
    "    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\n",
    "    model.save_low_bit(f'../model/{model_name}_int4')\n",
    "    tokenizer.save_pretrained(f'../model/{model_name}_int4')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a2285dc-cf33-4ffb-a498-5d67c82b4806",
   "metadata": {},
   "source": [
    "准备完转换后的量化权重，接下来我们将在终端中第一次运行 qwen2 在 CPU 上的大模型推理，但请注意不要在 notebook 中运行（本地运行可以在 notebook 中运行，由于魔搭 notebook 和终端运行脚本有一些区别，这里推荐在终端中运行。\n",
    "\n",
    "在运行下列代码块后，将会自动在终端中新建一个python文件，我们只需要在终端运行这个python文件即可启动推理：\n",
    "\n",
    "```python\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "python3 run.py\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0590d50a-7c85-4674-8db8-acdbaa758efd",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:31:26.283430Z",
     "iopub.status.busy": "2024-06-25T14:31:26.282988Z",
     "iopub.status.idle": "2024-06-25T14:31:26.288508Z",
     "shell.execute_reply": "2024-06-25T14:31:26.287890Z",
     "shell.execute_reply.started": "2024-06-25T14:31:26.283413Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile /mnt/workspace/run.py\n",
    "# 导入必要的库\n",
    "import os\n",
    "# 设置OpenMP线程数为8,优化CPU并行计算性能\n",
    "os.environ[\"OMP_NUM_THREADS\"] = \"8\"\n",
    "import torch\n",
    "import time\n",
    "from ipex_llm.transformers import AutoModelForCausalLM\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "# 指定模型加载路径\n",
    "load_path = \"qwen2chat_int4\"\n",
    "# 加载低位(int4)量化模型,trust_remote_code=True允许执行模型仓库中的自定义代码\n",
    "model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)\n",
    "# 加载对应的分词器\n",
    "tokenizer = AutoTokenizer.from_pretrained(load_path, trust_remote_code=True)\n",
    "\n",
    "# 定义输入prompt\n",
    "prompt = \"给我讲一个芯片制造的流程\"\n",
    "\n",
    "# 构建符合模型输入格式的消息列表\n",
    "messages = [{\"role\": \"user\", \"content\": prompt}]\n",
    "\n",
    "# 使用推理模式,减少内存使用并提高推理速度\n",
    "with torch.inference_mode():\n",
    "    # 应用聊天模板,将消息转换为模型输入格式的文本\n",
    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    # 将文本转换为模型输入张量,并移至CPU (如果使用GPU,这里应改为.to('cuda'))\n",
    "    model_inputs = tokenizer([text], return_tensors=\"pt\").to('cpu')\n",
    "\n",
    "    st = time.time()\n",
    "    # 生成回答,max_new_tokens限制生成的最大token数\n",
    "    generated_ids = model.generate(model_inputs.input_ids,\n",
    "                                   max_new_tokens=512)\n",
    "    end = time.time()\n",
    "\n",
    "    # 初始化一个空列表,用于存储处理后的generated_ids\n",
    "    processed_generated_ids = []\n",
    "\n",
    "    # 使用zip函数同时遍历model_inputs.input_ids和generated_ids\n",
    "    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):\n",
    "        # 计算输入序列的长度\n",
    "        input_length = len(input_ids)\n",
    "        \n",
    "        # 从output_ids中截取新生成的部分\n",
    "        # 这是通过切片操作完成的,只保留input_length之后的部分\n",
    "        new_tokens = output_ids[input_length:]\n",
    "        \n",
    "        # 将新生成的token添加到处理后的列表中\n",
    "        processed_generated_ids.append(new_tokens)\n",
    "\n",
    "    # 将处理后的列表赋值回generated_ids\n",
    "    generated_ids = processed_generated_ids\n",
    "\n",
    "    # 解码模型输出,转换为可读文本\n",
    "    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
    "    \n",
    "    # 打印推理时间\n",
    "    print(f'Inference time: {end-st:.2f} s')\n",
    "    # 打印原始prompt\n",
    "    print('-'*20, 'Prompt', '-'*20)\n",
    "    print(text)\n",
    "    # 打印模型生成的输出\n",
    "    print('-'*20, 'Output', '-'*20)\n",
    "    print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99c5eb2b-e2c8-41f8-8ffa-c03968d63722",
   "metadata": {},
   "source": [
    "在上面的代码中，我们演示的是等到结果完全输出后再打印的模式，但有聪明的同学肯定好奇，有什么方法能够让我们及时看到输出的结果？这里我们介绍一种新的输出模式——流式输出，流式的意思顾名思义就是输出是不断流动的，也就是不停的向外输出的。通过流式输出，我们可以很容易及时看到模型输出的结果。在 transformers 中，我们将会使用 `TextStreamer` 组件来实现流式输出，记得这个 python 文件同样需要在终端执行：\n",
    "```python\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "python3 run_stream.py\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29a82251-bd77-4f9b-ac7c-00edb1ccdb6e",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:31:26.289842Z",
     "iopub.status.busy": "2024-06-25T14:31:26.289623Z",
     "iopub.status.idle": "2024-06-25T14:31:28.208607Z",
     "shell.execute_reply": "2024-06-25T14:31:28.207735Z",
     "shell.execute_reply.started": "2024-06-25T14:31:26.289826Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile /mnt/workspace/run_stream.py\n",
    "# 设置OpenMP线程数为8\n",
    "import os\n",
    "os.environ[\"OMP_NUM_THREADS\"] = \"8\"\n",
    "\n",
    "import time\n",
    "from transformers import AutoTokenizer\n",
    "from transformers import TextStreamer\n",
    "\n",
    "# 导入Intel扩展的Transformers模型\n",
    "from ipex_llm.transformers import AutoModelForCausalLM\n",
    "import torch\n",
    "\n",
    "# 加载模型路径\n",
    "load_path = \"qwen2chat_int4\"\n",
    "\n",
    "# 加载4位量化的模型\n",
    "model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)\n",
    "\n",
    "# 加载对应的tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(load_path, trust_remote_code=True)\n",
    "\n",
    "# 创建文本流式输出器\n",
    "streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n",
    "\n",
    "# 设置提示词\n",
    "prompt = \"给我讲一个芯片制造的流程\"\n",
    "\n",
    "# 构建消息列表\n",
    "messages = [{\"role\": \"user\", \"content\": prompt}]\n",
    "    \n",
    "# 使用推理模式\n",
    "with torch.inference_mode():\n",
    "\n",
    "    # 应用聊天模板,添加生成提示\n",
    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    \n",
    "    # 对输入文本进行编码\n",
    "    model_inputs = tokenizer([text], return_tensors=\"pt\")\n",
    "    \n",
    "    print(\"start generate\")\n",
    "    st = time.time()  # 记录开始时间\n",
    "    \n",
    "    # 生成文本\n",
    "    generated_ids = model.generate(\n",
    "        model_inputs.input_ids,\n",
    "        max_new_tokens=512,  # 最大生成512个新token\n",
    "        streamer=streamer,   # 使用流式输出\n",
    "    )\n",
    "    \n",
    "    end = time.time()  # 记录结束时间\n",
    "    \n",
    "    # 打印推理时间\n",
    "    print(f'Inference time: {end-st} s')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85226459-6228-4c7f-9f1f-c9d7d95d97f3",
   "metadata": {},
   "source": [
    "恭喜你，你已经完全掌握了如何应用英特尔 ipex-llm 工具在 CPU 上实现 qwen2 大模型高性能推理。至此已掌握了完整输出 / 生成流式输出的调用方法；接下来我们讲更进一步，通过 Gradio 实现一个简单的前端来与我们在 cpu 上部署后的大模型进行对话，并实现流式打印返回结果。\n",
    "\n",
    "Gradio 是一个开源的 Python 库，用于快速构建机器学习和数据科学演示应用。它使得开发者可以在几行代码中创建一个简单、可调整的用户界面，用于展示机器学习模型或数据科学工作流程。Gradio 支持多种输入输出组件，如文本、图片、视频、音频等，并且可以轻松地分享应用，包括在互联网上分享和在局域网内分享.\n",
    "\n",
    "简单来说,利用 Graio 库,我们可以很容易实现一个具有对话功能的前端页面.\n",
    "\n",
    "注意! 在运行之前,我们需要安装 gradio 库依赖环境, 你需要在终端执行:\n",
    "\n",
    "```bash\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "pip install gradio\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cd73a31-0feb-4fd7-ad73-9e086725a620",
   "metadata": {},
   "source": [
    "需要强调的是，在运行之前我们还需要对启动命令进行修改才能正常使用 gradio 前端, 我们可以看到最后一句 gradio 的启动命令 ` demo.launch(root_path='/dsw-525085/proxy/7860/')` ,但每个人对应的不都是 dsw-525085,也许是 dsw-233333, 这取决于此时你的网页 url 链接上显示的地址是否是类似 `https://dsw-gateway-cn-hangzhou.data.aliyun.com/dsw-525085/` 的字眼,根据你显示 url 的对应数字不同,你需要把下面的 gradio 代码 root_path 中的 dsw标识修改为正确对应的数字,才能在运行后看到正确的 gradio 页面.\n",
    "\n",
    "在修改完 root_path 后,我们可以在终端中顺利运行 gradio 窗口:\n",
    "```python\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "python3 run_gradio_stream.py\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "24eb15d1-5e5a-4468-a1b4-8925f9bdf772",
   "metadata": {
    "ExecutionIndicator": {
     "show": true
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:31:41.182248Z",
     "iopub.status.busy": "2024-06-25T14:31:41.181784Z",
     "iopub.status.idle": "2024-06-25T14:31:41.187666Z",
     "shell.execute_reply": "2024-06-25T14:31:41.187145Z",
     "shell.execute_reply.started": "2024-06-25T14:31:41.182228Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile /mnt/workspace/run_gradio_stream.py\n",
    "import gradio as gr\n",
    "import time\n",
    "import os\n",
    "from transformers import AutoTokenizer, TextIteratorStreamer\n",
    "from ipex_llm.transformers import AutoModelForCausalLM\n",
    "import torch\n",
    "from threading import Thread, Event\n",
    "\n",
    "# 设置环境变量\n",
    "os.environ[\"OMP_NUM_THREADS\"] = \"8\"  # 设置OpenMP线程数为8,用于控制并行计算\n",
    "\n",
    "# 加载模型和tokenizer\n",
    "load_path = \"qwen2chat_int4\"  # 模型路径\n",
    "model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)  # 加载低位模型\n",
    "tokenizer = AutoTokenizer.from_pretrained(load_path, trust_remote_code=True)  # 加载对应的tokenizer\n",
    "\n",
    "# 将模型移动到GPU（如果可用）\n",
    "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")  # 检查是否有GPU可用\n",
    "model = model.to(device)  # 将模型移动到选定的设备上\n",
    "\n",
    "# 创建 TextIteratorStreamer，用于流式生成文本\n",
    "streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n",
    "\n",
    "# 创建一个停止事件，用于控制生成过程的中断\n",
    "stop_event = Event()\n",
    "\n",
    "# 定义用户输入处理函数\n",
    "def user(user_message, history):\n",
    "    return \"\", history + [[user_message, None]]  # 返回空字符串和更新后的历史记录\n",
    "\n",
    "# 定义机器人回复生成函数\n",
    "def bot(history):\n",
    "    stop_event.clear()  # 重置停止事件\n",
    "    prompt = history[-1][0]  # 获取最新的用户输入\n",
    "    messages = [{\"role\": \"user\", \"content\": prompt}]  # 构建消息格式\n",
    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)  # 应用聊天模板\n",
    "    model_inputs = tokenizer([text], return_tensors=\"pt\").to(device)  # 对输入进行编码并移到指定设备\n",
    "    \n",
    "    print(f\"\\n用户输入: {prompt}\")\n",
    "    print(\"模型输出: \", end=\"\", flush=True)\n",
    "    start_time = time.time()  # 记录开始时间\n",
    "\n",
    "    # 设置生成参数\n",
    "    generation_kwargs = dict(\n",
    "        model_inputs,\n",
    "        streamer=streamer,\n",
    "        max_new_tokens=512,  # 最大生成512个新token\n",
    "        do_sample=True,  # 使用采样\n",
    "        top_p=0.7,  # 使用top-p采样\n",
    "        temperature=0.95,  # 控制生成的随机性\n",
    "    )\n",
    "\n",
    "    # 在新线程中运行模型生成\n",
    "    thread = Thread(target=model.generate, kwargs=generation_kwargs)\n",
    "    thread.start()\n",
    "\n",
    "    generated_text = \"\"\n",
    "    for new_text in streamer:  # 迭代生成的文本流\n",
    "        if stop_event.is_set():  # 检查是否需要停止生成\n",
    "            print(\"\\n生成被用户停止\")\n",
    "            break\n",
    "        generated_text += new_text\n",
    "        print(new_text, end=\"\", flush=True)\n",
    "        history[-1][1] = generated_text  # 更新历史记录中的回复\n",
    "        yield history  # 逐步返回更新的历史记录\n",
    "\n",
    "    end_time = time.time()\n",
    "    print(f\"\\n\\n生成完成，用时: {end_time - start_time:.2f} 秒\")\n",
    "\n",
    "# 定义停止生成函数\n",
    "def stop_generation():\n",
    "    stop_event.set()  # 设置停止事件\n",
    "\n",
    "# 使用Gradio创建Web界面\n",
    "with gr.Blocks() as demo:\n",
    "    gr.Markdown(\"# Qwen 聊天机器人\")\n",
    "    chatbot = gr.Chatbot()  # 聊天界面组件\n",
    "    msg = gr.Textbox()  # 用户输入文本框\n",
    "    clear = gr.Button(\"清除\")  # 清除按钮\n",
    "    stop = gr.Button(\"停止生成\")  # 停止生成按钮\n",
    "\n",
    "    # 设置用户输入提交后的处理流程\n",
    "    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(\n",
    "        bot, chatbot, chatbot\n",
    "    )\n",
    "    clear.click(lambda: None, None, chatbot, queue=False)  # 清除按钮功能\n",
    "    stop.click(stop_generation, queue=False)  # 停止生成按钮功能\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    print(\"启动 Gradio 界面...\")\n",
    "    demo.queue()  # 启用队列处理请求\n",
    "    demo.launch(root_path='/dsw-525085/proxy/7860/')  # 兼容魔搭情况下的路由"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fa1a6122-3e52-41e4-9168-a8de05af22dd",
   "metadata": {},
   "source": [
    "# 在运行之前,我们需要安装 gradio 库依赖环境\n",
    "当然,除了 gradio 之外,我们还有另一款流行的 python 前端开源库可以方便我们的大模型对话应用,它的名字叫 Streamlit, 简单来说, Streamlit是一个Python库，用于快速构建交互式Web应用程序。它提供了一个简单的API，允许开发者使用Python代码来创建Web应用程序，而无需学习复杂的Web开发技术. 这听上去是不是与 gradio 差不多? 你可以选择自己喜欢的一款前端库来完成对应 AI  应用的开发,具体细节可以参考它的官方网站 https://streamlit.io/, 在这里,我们可以跑一个最简单的聊天界面来体验 gradio 与 Streamlit 开发与体验的不同之处.\n",
    "\n",
    "注意, 在运行之前,我们需要安装 streamlit 库依赖环境\n",
    "\n",
    "```bash\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "pip install streamlit\n",
    "```\n",
    "\n",
    "```python\n",
    "cd /mnt/workspace\n",
    "conda activate ipex\n",
    "streamlit run run_streamlit_stream.py\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb3ccf46-86c6-4695-93f9-fabff5b9dcbf",
   "metadata": {
    "ExecutionIndicator": {
     "show": false
    },
    "execution": {
     "iopub.execute_input": "2024-06-25T14:31:41.188948Z",
     "iopub.status.busy": "2024-06-25T14:31:41.188483Z",
     "iopub.status.idle": "2024-06-25T14:31:41.193615Z",
     "shell.execute_reply": "2024-06-25T14:31:41.193194Z",
     "shell.execute_reply.started": "2024-06-25T14:31:41.188928Z"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%writefile /mnt/workspace/run_streamlit_stream.py\n",
    "\n",
    "# 导入操作系统模块，用于设置环境变量\n",
    "import os\n",
    "# 设置环境变量 OMP_NUM_THREADS 为 8，用于控制 OpenMP 线程数\n",
    "os.environ[\"OMP_NUM_THREADS\"] = \"8\"\n",
    "\n",
    "# 导入时间模块\n",
    "import time\n",
    "# 导入 Streamlit 模块，用于创建 Web 应用\n",
    "import streamlit as st\n",
    "# 从 transformers 库中导入 AutoTokenizer 类\n",
    "from transformers import AutoTokenizer\n",
    "# 从 transformers 库中导入 TextStreamer 类\n",
    "from transformers import TextStreamer\n",
    "# 从 ipex_llm.transformers 库中导入 AutoModelForCausalLM 类\n",
    "from ipex_llm.transformers import AutoModelForCausalLM\n",
    "# 导入 PyTorch 库\n",
    "import torch\n",
    "\n",
    "# 指定模型路径\n",
    "load_path = \"qwen2chat_int4\"\n",
    "# 加载低比特率模型\n",
    "model = AutoModelForCausalLM.load_low_bit(load_path, trust_remote_code=True)\n",
    "# 从预训练模型中加载 tokenizer\n",
    "tokenizer = AutoTokenizer.from_pretrained(load_path, trust_remote_code=True)\n",
    "\n",
    "# 定义回调函数，用于逐步更新生成的文本\n",
    "def stream_callback(text, message_placeholder, full_response):\n",
    "    # 更新完整响应\n",
    "    full_response += text\n",
    "    # 在 Streamlit 界面上显示带有光标的响应\n",
    "    message_placeholder.markdown(full_response + \"▌\")\n",
    "    return full_response\n",
    "\n",
    "# 定义生成响应函数\n",
    "def generate_response(prompt, message_placeholder):\n",
    "    # 将用户的提示转换为消息格式\n",
    "    messages = [{\"role\": \"user\", \"content\": prompt}]\n",
    "    # 应用聊天模板并进行 token 化\n",
    "    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\n",
    "    model_inputs = tokenizer([text], return_tensors=\"pt\")\n",
    "    \n",
    "    # 创建 TextStreamer 对象，跳过提示和特殊标记\n",
    "    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)\n",
    "    \n",
    "    # 初始化一个空字符串用于存储完整响应\n",
    "    full_response = \"\"\n",
    "    # 定义回调函数\n",
    "    def callback(text):\n",
    "        nonlocal full_response\n",
    "        # 调用 stream_callback 函数更新响应\n",
    "        full_response = stream_callback(text, message_placeholder, full_response)\n",
    "        \n",
    "    # 设置 TextStreamer 的回调函数\n",
    "    streamer.on_text_chunk = callback\n",
    "    \n",
    "    # 使用推理模式生成响应\n",
    "    with torch.inference_mode():\n",
    "        generated_ids = model.generate(\n",
    "            model_inputs.input_ids,\n",
    "            max_new_tokens=512,\n",
    "            streamer=streamer,\n",
    "        )\n",
    "\n",
    "    # 初始化一个空列表，用于存储处理后的 generated_ids\n",
    "    processed_generated_ids = []\n",
    "\n",
    "    # 使用 zip 函数同时遍历 model_inputs.input_ids 和 generated_ids\n",
    "    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):\n",
    "        # 计算输入序列的长度\n",
    "        input_length = len(input_ids)\n",
    "        \n",
    "        # 从 output_ids 中截取新生成的部分\n",
    "        # 这是通过切片操作完成的，只保留 input_length 之后的部分\n",
    "        new_tokens = output_ids[input_length:]\n",
    "        \n",
    "        # 将新生成的 token 添加到处理后的列表中\n",
    "        processed_generated_ids.append(new_tokens)\n",
    "\n",
    "    # 将处理后的列表赋值回 generated_ids\n",
    "    generated_ids = processed_generated_ids\n",
    "    \n",
    "    # 解码生成的 token 为文本\n",
    "    response_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)\n",
    "    \n",
    "    return response_text\n",
    "\n",
    "# Streamlit 应用部分\n",
    "# 设置应用标题\n",
    "st.title(\"大模型聊天应用\")\n",
    "\n",
    "# 初始化聊天历史，如果不存在则创建一个空列表\n",
    "if \"messages\" not in st.session_state:\n",
    "    st.session_state.messages = []\n",
    "\n",
    "# 显示聊天历史\n",
    "for message in st.session_state.messages:\n",
    "    with st.chat_message(message[\"role\"]):\n",
    "        st.markdown(message[\"content\"])\n",
    "\n",
    "# 用户输入部分\n",
    "if prompt := st.chat_input(\"你想说点什么?\"):\n",
    "    # 将用户消息添加到聊天历史\n",
    "    st.session_state.messages.append({\"role\": \"user\", \"content\": prompt})\n",
    "    with st.chat_message(\"user\"):\n",
    "        st.markdown(prompt)\n",
    "\n",
    "    # 创建空的占位符用于显示生成的响应\n",
    "    with st.chat_message(\"assistant\"):\n",
    "        message_placeholder = st.empty()\n",
    "        \n",
    "        # 调用模型生成响应\n",
    "        response = generate_response(prompt, message_placeholder)\n",
    "        \n",
    "        # 在占位符中显示响应\n",
    "        message_placeholder.markdown(response)\n",
    "    \n",
    "    # 将助手的响应添加到聊天历史\n",
    "    st.session_state.messages.append({\"role\": \"assistant\", \"content\": response})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02d42fac-80f2-4ee4-85ab-9859ae557a80",
   "metadata": {},
   "source": [
    "至此,你已经完全入门 IPEX-LLM 对大语言模型的部署工程,但 LLM 部署只是第一步,基于 LLM 的应用才是关键,对于这样一款能在端侧上运行的大模型推理优化系统,你会用他优化后的大模型做些什么有趣的大模型原生应用? 期待你的想法能创造出令人惊叹的AI作品.\n",
    "\n",
    "祝你好运!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ipex",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
